Application of untargeted liquid chromatography-mass spectrometry to routine analysis of food using three-dimensional bucketing and machine learning

For the detection of food adulteration, sensitive and reproducible analytical methods are required. Liquid chromatography coupled to high-resolution mass spectrometry (LC-HRMS) is a highly sensitive method that can be used to obtain analytical fingerprints consisting of a variety of different components. Since the comparability of measurements carried out with different devices and at different times is not given, specific adulterants are usually detected in targeted analyses instead of analyzing the entire fingerprint. However, this comprehensive analysis is desirable in order to stay ahead in the race against food fraudsters, who are constantly adapting their adulterations to the latest state of the art in analytics. We have developed and optimized an approach that enables the separate processing of untargeted LC‑HRMS data obtained from different devices and at different times. We demonstrate this by the successful determination of the geographical origin of honey samples using a random forest model. We then show that this approach can be applied to develop a continuously learning classification model and our final model, based on data from 835 samples, achieves a classification accuracy of 94% for 126 test samples from 6 different countries.


LC-HRMS analysis
The honey samples were analyzed using three LC-HRMS devices comprising Thermo Scientific™ UltiMate™ 3000 systems coupled to Thermo Scientific™ Q Exactive™ Hybrid Quadrupole-Orbitrap™ High-Resolution Mass Spectrometers.Each sample was measured with two different Liquid Chromatography (LC) approaches.Due to preliminary tests, hydrophilic interaction liquid chromatography (HILIC, Accucore-150-Amide-HILIC 150 × 2,1) in negative ion mode was applied to analyze polar compounds while non-polar compounds were analyzed by reverse phase (RP, Hypersil Gold C18* 150 × 2,1) chromatography in positive ion mode.This setup generated two different types of data for each sample, which will be referred to as HILIC and RP data.For both, the mobile phases water and acetonitrile with acetic acid as modifier were used.For the normalization strategy ISTD_add (see following subsection), a third mobile phase with 2% sorbic acid in positive ion mode (RP method) and 10% in negative ion mode (HILIC method) in an acetonitrile-water-mixture (50/50, v/v) with acetic acid as additive was isocratically injected.The different sorbic acid concentrations were chosen due to missing signals in the spectra obtained in negative ion mode.
The ionization of the chromatographically separated compounds was performed using electrospray ionization (ESI) and the data was acquired in profile mode.The mass spectrometric analysis was conducted using the variable data-independent acquisition (vDIA) method developed by Thermo Scientific™ Orbitrap™.vDIA is a data acquisition method that utilizes MS/MS precursor isolation windows of differing mass ranges, covering the entire mass range of the preceding full scan 33 .For the data analyzed in this study, the full scan comprised 100-1500 Da (MS1).This range was covered by six isolation windows, in which ions of the masses 150, 250, 350, 450, 750 and 1250 Da were fragmented respectively (MS2).The respective fragments were detected in the ranges 50-225, 50-330, 50-430, 50-535, 69-1045 and 104-1555 Da 4 .

Data processing with BOULS
The developed BOULS approach is schematically illustrated in Fig. 1 and published at https:// github.com/ AGSei fert/ BOULS (requires Linux OS).It is based on the established xcms workflow in R and uses the same functions for data import and chromatographic peak detection 34 .
For data import (step 1 in Fig. 1), the thermo-specific raw files are first converted to open-format mzML files using MSConvert, which is part of the ProteoWizard software package (Version: 3.0.21078-7da1f1136(developer build)).Here, the filter peakPicking is used to convert the profile mode data into centroided data 35 .Subsequently the data is imported into R using the Bioconductor package mzR [35][36][37][38][39] (version 2.26.0) storing information about the samples and analyses, such as name, geographical origin, instrument and method.The package MSnbase 40,41 (version 2.18.0) is then used to load and store the data in an object that is compatible with the xcms package [21][22][23] (version 3.14.0).The chromatographic peak detection (step 2 in Fig. 1) is performed by the centWave algorithm 23 using the parameters peakwidth of 15 s and ppm of 5 Da, which is individual for each experimental set up and which was set according to the extracted ion chromatogram of the internal standards.The retention time alignment (step 3 in Fig. 1) is performed by using the obiwarp method 42 , whereby the parameters binSize and localAlignment are set to 0.1 and TRUE, respectively.The spectrum with the most peaks was used as center spectrum.
Subsequently, the novel BOULS step (step 4 in Fig. 1) is conducted and the data is divided into 3 dimensional buckets summing up the total intensity of the signals (see Fig. 1b and c).The parameters of this step, namely the bucket size in the retention time (RT) and m/z dimensions, are optimized in a DOE, which is described and evaluated below.
Various approaches have been applied for normalization (step 5 in Fig. 1).For the first normalization strategy, the internal standard in the third mobile phase (ISTD_add) sorbic acid was used and the total intensity of an RTbucket was divided by the total intensity of all sorbic acid signal intensities.For total ion current (TIC) and base peak chromatogram (BPC) normalization, the summarized intensities of each bucket were divided by the sum of intensities of all signals and the most intensive signal, respectively.For TIC_RT and BPC_RT, the individual RT-buckets were considered separately and the intensity of a bucket was divided by either the sum (TIC_RT) Figure 1.(a) Overview of the novel data processing approach for the analysis of untargeted LC-HRMS data in routine analysis: A new step called BOULS replaces the correspondence step in previous workflows.To introduce this step, an example LC-HRMS spectrum after retention time alignment is shown in (b).Subsequently, the same spectrum after the application of BOULS summing up the signals in the respective buckets is depicted in (c).
or the maximum (BPC_RT) of the intensities of the respective RT-bucket.Also, the different normalization approaches were evaluated in a DOE.
For classification (step 6 in Fig. 1), RF was applied using the R package ranger 43 (version 0.14.1,CRAN) with the parameters ntree = 5000 and the respective default parameters for mtry and min.node.size(square root of the total number of variables and 1, respectively).To compensate for class imbalance, the parameter case.weights was chosen according to the size of the respective classes 43 .

Data
For the development, validation and application of the BOULS approach, four data sets compiled from samples of customers of a routine laboratory were used.As the successful analysis and comparison over a long period of time are crucial, different time periods were chosen for obtaining the training data and between obtaining the training and test data.For the analysis of each sample set, both, HILIC and RP, were used and the samples were measured in approximately equal proportions using one of the three instruments.

Data set 1
To optimize the BOULS approach parameters by a DOE (see section DOE1: Initial optimization), a data set obtained from 123 honey samples (246 spectra) from Argentina (18), Brazil (19), Canada (18), China (8), Ukraine (40) and the USA ( 20), was analyzed.The measurements were performed over a period of six weeks and the performance was evaluated by the OOB error of the obtained random forests.

Data set 2
To test the robustness of the developed approach over time, training and test data measured with a time offset of seven months were analyzed.For training, data set 1 was used.However, since they were not included in the test data, the data from the Chinese samples were omitted.The test set consisted of 27 spectra from 5, 5, 4, 6 and 7 samples from Argentina, Brazil, Canada, Ukraine and the USA, respectively.

Data set 3
The third data set was used for the initial implementation of the developed approach in routine analysis.For training, data of 565 samples from Argentina (93), Brazil (65), Canada (9), India (69), Ukraine (250) and the USA (79) were used.The test set consisted of 126 spectra from 8, 11, 65, 1, 13 and 28 samples from Argentina, Brazil, India, Canada, Ukraine, and the USA, respectively.

Data set 4
The fourth data set was used to analyze, whether the prediction accuracy improves by extending the model in long-term application.The extended training data set contained data from 835 samples from Argentina (170), Brazil (133), Canada (16), India (126), Ukraine (264), and the USA (126).The resulting model was used to classify the same test data as in the previous section to compare the performance of the two models directly.

Design of experiment (DOE) for the optimization of the BOULS parameters
In order to optimize the parameters of the BOULS approach, two full factorial DOEs described in the following two subsections were conducted.In each case, the expand.gridfunction of the R base package 44 (version 4.2.2) was used to generate a data frame with all possible combinations of the factor levels.After the execution of the workflow, Analysis of Variance (ANOVA) (aov function of the base stats package) in combination with Tukey's honest significant difference (TukeyHSD function in base R package) was applied to compare the different factors and calculate p-values.

DOE1: Initial optimization
Table 1 shows the applied factors together with their respective levels of the first DOE.Bucket sizes for the RT and m/z dimension, as well as the different normalization approaches introduced in the previous section were evaluated.

DOE2: Validation and optimization for long-term application
In order to establish the BOULS approach for long-time application, a second DOE was applied with the factors and levels shown in Table 2.In addition to the already applied factors in DOE 1, which were validated for long-term application, also the addition of fragment (MS2) data and the combination of HILIC and RP data was evaluated.For the former, the fragment RP data was processed with the BOULS step with either small buckets (5 Da and 5 s) or large buckets (20 Da and 80 s) and combined with the full RP data or not.Since the individual isolation windows vary in size (see section LC-HRMS analysis), the exact size of the buckets was adjusted to achieve divisibility.For the latter, the HILIC data was either included with or without the fragment data that was processed as just described or not.

Results and discussion
This work presents BOULS, a data processing workflow that allows the application of untargeted LC-HRMS for routine analysis.In the following sections, this approach will be applied for the analysis of data from different devices and the BOULS parameters will be analyzed and optimized (first subsection).Subsequently, the longterm applicability with optimized parameters will be demonstrated (second subsection), and an approach for the implementation in routine analysis is shown (third subsection).

Analysis of data from different devices
For the analysis and optimization of the BOULS parameters, a full factorial DOE for both HILIC and RP data was applied and the results are shown in Table 3.The size of the buckets was optimized for the RT and mass dimension.The analysis of the RT dimension showed that the largest possible bucket size should be used for the HILIC data as significantly lower OOB errors were achieved using the size 80 s compared to 20 s (p < 0.01), 80 s compared to 5 s (p < 0.001) and 20 s compared to 5 s (p < 0.01).On the contrary, the results for the RP data showed that the lowest possible bucket size should be used because significantly lower OOB errors were achieved using the size 20 s compared to 80 s (p < 0.01) and 5 s compared to 80 s (p < 0.001).In principle, the determination of the RT bucket size involves a compromise between compensation for technical differences between the measurements, mainly caused by the use of different devices, and a greater resolution of the information contained in the spectra (small bucket size).The former seems to be more important for the HILIC data, while the latter is more important for the RP data.This confirms the property that HILIC chromatography is less robust than RP chromatography 45 .For the optimization of the bucket size in mass dimension, small values were advantageous for the HILIC data since significantly smaller OOB errors were achieved using the level 2 Da compared to 10 Da (p < 0.05), 2 Da compared to 20 Da (p < 0.001), 5 Da compared to 20 Da (p < 0.001) and 10 Da compared to 20 Da (p < 0.01).For the RP data, also smaller values for this bucket size showed smaller OOB errors but the differences were not significant.The general preference of smaller values for this parameter can be explained by the fact that a greater resolution of the signals results in models based on more detailed information, which improves the classification.Deviations in the mass signals are not present in the same extend as in the retention time and must therefore not be compensated for by high values for this bucket size.The comparison of the individual normalization approaches shows that the use of an internal standard in ISTD_add generally results in significantly higher OOB errors for the RP data (p < 0.001 for all comparisons) and higher or significantly higher errors for the HILIC data.Since this normalization is obviously inferior to the others, it was neglected for the workflow and the following analyses.The other normalization approaches mostly do not show significant differences, which is why all of them were further evaluated.Table 4 shows the confusion matrix for the classification of the geographical origin of honey using the best parameter combination for each of the two data sets.Classification based on the RP data results in more accurate results with an overall OOB-error of 5% compared to 10% using the HILIC data.This is particularly evident in the classification of samples from the USA and Brazil and is probably due to the lower robustness of the HILIC method.
The overall high classification accuracy demonstrates the general applicability of our approach using bucketed spectra of different devices over six weeks, a comparatively short period of time.In the next section, we will further optimize data processing parameters and analysis for the application of the approach over a longer period of time.

Validation and optimization of the BOULS approach for long-term application
For practical application of machine learning approaches in routine analysis, a model is usually trained with currently available data and applied to data that is obtained at a much later point in time from samples that are then to be classified.Therefore, we evaluated BOULS and its parameters under these conditions in a second DOE.In addition to the parameters of the bucket sizes and normalization strategies, we also analyzed here, whether Table 3. Results of the initial optimization of the BOULS parameters for the HILIC and RP data based on DOE1.The respective level reaching the lower OOB error is in italics and bold for the analysis of HILIC and RP data, respectively.The third and fourth columns show the differences between the averaged classification accuracies for the HILIC and RP data, respectively.*p < 0.05.**p < 0.01.***p < 0.001.a fusion of the RP data with the HILIC data and an addition of the MS2 fragment data would be advantageous for long-term application.The results are summarized in Table 5. Analysis of bucket size showed, that there were no significant differences between the bucket sizes in the RT dimension while at the mass dimension the bucket size of 2 Da resulted in significantly lower classification errors (p < 0.05).For the evaluation of the normalization, using the sum of all signal intensities (TIC) significantly outperformed all of the other approaches (p < 0.001).No significant differences were observed for adding RP MS2 data (factor: fragments) or HILIC data with or without fragment data (factor: data fusion), even though there were differences in classification error of up to 5%.Based on these results, there would be two options for selecting the optimum parameters for further application.First, with view to computational effort and the aim of having a model that is as simple as possible, only RP data could be used.However, since we observed differences that could become significant in the long run, especially when data of new classes are added, we decided to evaluate the parameter combinations that provide the best classification results (see Supplementary Table S1).
Here it can be seen that both the lowest error for the test data of 15% and for the training data of 7% is achieved with the parameter combination, using bucket sizes of 20 s and 2 Da for the RP MS1 data, a TIC normalization, large buckets of 80 s and 20 Da for the RP MS2 fragment and to include the HILIC MS1 data with equal bucket sizes as for RP MS1 and with HILIC MS2 data with small buckets of 5 s and 5 Da (confusion matrix of the test data in Supplementary Table S2).In general, the high classification accuracy shows that the use of LC-HRMS over a longer period of time is possible in a routine laboratory if the BOULS approach is applied.

Implementation in routine analysis
Having demonstrated the general applicability of BOULS for analyzing LC-HRMS data from multiple devices over a long period of time and defined a parameter combination for the application, we will now use the example of determining the geographical origin of honey to show how it can be used for routine analysis of food in a commercial laboratory.The general procedure is shown in Fig. 2.
An initial dataset is processed with BOULS and used to train an RF classification model.Subsequently, data of new samples with unknown geographic origin are processed separately and, the geographic origin of these test data is predicted by the model.The results of the prediction are compared to the results of reference methods like pollen analysis to evaluate the model.The data from the sample, including the correct class assignment, is then added to the training data, which is very simple because the data is processed into equal buckets with BOULS instead of using the correspondence step applied in other workflows.Finally, a new RF model based on the extended training data is trained and used for the prediction of the subsequent samples.In this way, continuous learning models are developed that increasingly reflect the entire variance of the relevant groups.This is shown in the example in Table 6, where we predicted the same test data with a growing model.The improved model, based on the larger training data set, achieved an 11% higher overall accuracy, mainly due to a better prediction of the American and Brazilian samples.
However, the only Canadian samples is still not classified correctly when the extended model is used.Since also the extended model is only based on 16 Canadian samples, it is reasonable to assume that a correct

Conclusions
In this study, we have presented the BOULS approach for LCMS data processing making it possible to use data from samples measured on different devices and at different times.This is achieved through the generation of a standardized data structure and enables the training of large machine learning models for routine analysis in commercial laboratories.We have demonstrated this application here for the determination of the geographical origin of honey, which is a highly contaminating matrix for LCMS analyses.Since this characteristic strongly influences the comparability of the spectra and the use of BOULS obviously makes this possible anyway, we assume that this approach can be applied to a wide range of different foods and other biological samples.Most of the parameters of the workflow proved to be quite robust in our comparisons, which greatly simplifies this expansion of the application.Furthermore, it is also possible to extend the approach to analyze liquid chromatography-ion mobility spectrometry-mass spectrometry (LC-IMS-MS) data, thereby incorporating even more information into the classification models.

Figure 2 .
Figure 2. Schematic overview illustration of the application of the BOULS approach in routine analysis.

Table 1 .
Factors and levels of DOE 1 for the initial optimization of applied parameters of the BOULS approach.

Table 2 .
Factors and levels of the second DOE for validation and optimization for long-term application of the BOULS approach.

Table 4 .
Confusion matrix for the RF model using HILIC (first value) and RP (second value, bold) data achieving an OOB error of 10 and 5%, respectively.The HILIC data was processed with the bucket sizes 80 s and 5 Da and the normalization strategy TIC, while the RP data was processed with the bucket sizes 5 s and 2 Da and the normalization strategy TIC.

Table 5 .
Results of DOE 2 for long-term application of BOULS.The level reaching the lower classification error for the test data is shown in bold in the second column and the third column shows the differences between the classification errors.*p < 0.05.**p < 0.01.***p < 0.001.classification can also be achieved here if the entire variance of this class is represented by a correspondingly higher number of samples.

Table 6 .
Confusion matrix for the prediction of the test data with the initial RF model based on data from 565 samples (first value) and the extended RF model based on data from 835 samples (second value, bold) achieving an overall accuracy of 83% and 94%, respectively.